In this project we conducted spatial analysis of Kensington and Chelsea district (later referred to as K&D) in London. We used data from: - Nomis site – We used data from 2011 Census on OA level (lowest division). All the data was downloaded in percentages - HM Land Registry – We used data for house sales made in Kensington and Chelsea district in 2011 - Helpful code snippets
We chose Kensington and Chelsea district for our analysis because we thought it is one of the most “mixed” districts in London. It is the smallest borough in London and the second smallest district in England; it is one of the most densely populated administrative regions in the United Kingdom. It also includes affluent areas such as Notting Hill, Kensington, South Kensington, Chelsea, and Knightsbridge. The fact that it contains many of the most expensive residential properties in the world may show some distinct social inequalities. At the 2011 census, the borough had a population of 158,649 who were 71% White, 10% Asian, 5% of multiple ethnic groups, 4% Black African and 3% Black Caribbean. A 2017 study by Trust for London and the New Policy Institute found that Kensington & Chelsea has the greatest income inequality of any London Borough. Private rent for low earners was also found to be the least affordable in London. However, the borough’s poverty rate of 28% is roughly in line with the London-wide average.
All these factors convinced us that more in-depth analysis of that district might result in some insightful outcomes.
In further analysis we decided to verify 2 major hypotheses: - Employed indicator does differ among different groups of OAs. We can explain it by using demographic factors both from this OA and from neighbouring ones using spatially lagged variables. - Mean of house prices in OAs is affected by its’ socio-economic structure as well as neighbouring OAs - introducing Geographically Weighted Regression improves the prediction accuracy.
First, let us load the previously prepared data:
Census.Data <- read.csv("census_data.csv")
houseData <- read.csv("house_data.csv")
House.Points <-SpatialPointsDataFrame(houseData[,6:7], houseData,
proj4string = CRS("+init=EPSG:27700"))
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum OSGB_1936 in CRS definition
hist_df <- gather(Census.Data[,-1], key = "name", value = "value")
Output.Areas <- readOGR("data/statistical-gis-boundaries-london/ESRI", "OA_2011_London_gen_MHW")
## Warning in OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS =
## dumpSRS, : Discarded datum OSGB_1936 in CRS definition: +proj=tmerc +lat_0=49
## +lon_0=-2 +k=0.999601272 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\Asus\GIT\spatial-analysis\data\statistical-gis-boundaries-london\ESRI", layer: "OA_2011_London_gen_MHW"
## with 25053 features
## It has 17 fields
Output.Areas <- Output.Areas[Output.Areas$LAD11NM=="Kensington and Chelsea",]
OA.Census <- merge(Output.Areas, Census.Data, by.y ="OA", by.x="OA11CD")
proj4string(OA.Census) <- CRS("+init=EPSG:27700")
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum OSGB_1936 in CRS definition
## Warning in proj4string(obj): CRS object has comment, which is lost in output
## Warning in `proj4string<-`(`*tmp*`, value = new("CRS", projargs = "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs")): A new CRS was assigned to an object with an existing CRS:
## +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601272 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
## without reprojecting.
## For reprojection, use function spTransform
House.Agg <- houseData %>%
group_by(oa11) %>%
dplyr::summarize(mean_price = mean(price_paid, na.rm=TRUE))
houses_merged <- Census.Data %>%
inner_join(House.Agg, by = c("OA" = "oa11"))
OA.Census.mp <- merge(Output.Areas, houses_merged, by.y ="OA", by.x="OA11CD", all = FALSE)
proj4string(OA.Census.mp) <- CRS("+init=EPSG:27700")
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum OSGB_1936 in CRS definition
## Warning in proj4string(obj): CRS object has comment, which is lost in output
## Warning in `proj4string<-`(`*tmp*`, value = new("CRS", projargs = "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs")): A new CRS was assigned to an object with an existing CRS:
## +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601272 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs
## without reprojecting.
## For reprojection, use function spTransform
We use histograms to explore denisty of each variable, take a brief look into how the dataset looks like and check for any potential model variables in the further analysis.
Now, let’s check distribution of percentages across all OA’s of religion-based variables, race-based and certain chosen variables. We can see that the majority of inhabitants of K&D are christian - however there are many outliers in muslim religion meaning that there could be heavy muslim-biased OA districts in K&D.
Once again, there is one overwhelming group - white people, but there are few outliers in every religion, therefore meaning again that we might have some districts that would greatly differ from the rest.
Finally, looking at the distribution of chosen variables, we can see that there are for example OA’s with very low employment rate or rather high percentage of families where only children under age of 6 have English as their primiary language, so we expect (considering previous two boxplots) that whole K&D might be divised into wealthy areas, poor areas, areas mostly occupied by immigrants and religious based areas. That is a good starting point for further spatial analysis.
Now let us explore dependencies between some variables that we would expect to be somehow correlated. We can see that on scatterplots below: Looking at the first one we can see that there is a negative correlation between percentage of highest qualification within OA and the unemployment rate. Additionally OA’s with more black/african people and of muslim faith are mostly the ones with highest unemployment rate. That indicates that there might be a problem in employment opportunities and education for immigrants.
On the second one the correlation between being employment rate and percentage of inhabitants with highest qualification is rather positive. However, what is interesting is that OA’s with the most qualified and highest employment rate are also the ones with the most white people from european countries other than United Kingdom. That might raise suspicion, that there might be racial problem for immigrants - it is far easier for you to get education and employment if you are white. One would argue that many students from across Europe aspire to study/work in London, so this might also create the bias. However this thesis is not supported by gathered data.
Correlation plot: Variables that we expected to be correlated with each other matched the expectations. Most of the non-white non-native OA’s are also the ones with most social rented flats, biggest unemployment rate and with lowest qualification.
Now we will explore how percentages of citizens with highest qualification, of black/african or white race distribute among OAs. All plots are interactive, so we can analyze it also with reference to the geographical location.
As we previously expected - OAs with highest densities of black/africans are also packed with people of muslim faith. On the contrary, areas with mostly white people are also the ones with highest rate of highly qualified citizens. We can also see that OAs in north of the district (North Kensington mostly) are visibly apart from the center and south (Chelsea) which is mostly occupied by natives.However, what is truly remarkable, OAs with most people born in UK are also the ones having the most of the black/african people. So where do these highly qualified white people residing in south come from?
Let us explore how prices of houses in K&D compare with distribution of citizens that were born in EU countries:
So, there is the answer - those mostly-white OAs with high rate of qualification are the areas occupied mostly by EU born citizens. Prices in these areas are also much higher. We added distribution of unemployment - both to show that it overlaps with the previous north areas and it is negatively correlated with prices. We expect that there might not be direct cause and effect - it is probably more of a vicious cycle. From density plots we can see that houses in the “richest” OAs are also ares with most houses sold/highest rotation.
In this part we are going to focus our study on the causes of the size of the employment rate as well as the housing prices in the Royal Borough of Kensington and Chelsea.
Firstly, we need to compute polygons for our dataset, which we are going to use in later parts of the study. Doing this in two different ways results in:
Having the neighbours computed, we can ran Moran’s test, which will result in correlation score for our employed variable.
Moran I Test:
##
## Moran I test under randomisation
##
## data: OA.Census$employed
## weights: listw
##
## Moran I statistic standard deviate = 14.574, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.3543126571 -0.0015873016 0.0005963801
Employed indicator has 0.34 Moran’s I statistic so it has a slight positive autocorrelation. Now, we have reasons to believe that the data does spatially cluster.
Analysis of the local spatial autocorrelation may result in broader conclusions.
Below, you can observe local moran statistic on our map broken down by OA:
We can clearly see that there are indeed areas, which are surrounded by units with similar values - those with positive local moran statistic. This further confirms that the data spatially cluster. However, this map does not bring us specific information about those areas/clusters.
To get the insights of each one, we are going to utilize LISA cluster map.
We set the level of significance to 0.2 for quality visualizations. We can now observe, which clusters are of high and which of low values. North areas of Kensington have mostly low employment rates however they border with areas of high values (hence light blue color), while some of the Chelsea’s areas with high employment rates border with low values (hence light red), even though these are in the center of Chelsea. Additionally we can see a few clusters of high-high and low-low (strong red/blue color) - these are areas bordering with other similar to them in terms of employment. We can conclude, that the data is significantly clustered in means of employment.
We can also check whether the Getis-Ord Gi statistic can help us with our analysis. Thanks to that we can broaden our analysis with proximity based neighbours instead of border based.
We set the proximity to 750 meters (which emphasize clusters most efficiently) and observe hot-spots based on intenisty of clusters, on the map below:
## Variable(s) "gi_statistic" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
We can clearly see that there are three main clusters: - of low values in the north - of high values in the center - of slightly high values in the south
Let’s see if we can infer more from regression models and explain the employment rate.
We will start with the OLS model.
Using backward variable selection we eliminated most of the insignificant variables and came up with following model:
##
## Call:
## lm(formula = OA.Census$employed ~ . - 1, data = OA.Census[, sig_cols_2])
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.3137 -3.2513 0.3911 3.7125 19.6324
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## white 0.31933 0.02140 14.923 < 2e-16 ***
## black_african 0.28316 0.05154 5.494 5.71e-08 ***
## single 0.12605 0.01913 6.588 9.44e-11 ***
## lowest_quali 0.58775 0.10107 5.815 9.65e-09 ***
## highest_quali 0.54895 0.02709 20.267 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.733 on 626 degrees of freedom
## Multiple R-squared: 0.992, Adjusted R-squared: 0.9919
## F-statistic: 1.548e+04 on 5 and 626 DF, p-value: < 2.2e-16
We obtain almost ideally fitted model, with R-squared = 0.99. It means that the variables that we chose explain it very accurately. We can further explore the coefficients- all of them are positive, however qualifications level is the strongest one. Indicator that people live alone being the weakest. We can also observe that both white and black races, surprisingly considering previous data explorations, impact the employed indicator in almost the same way.
Let’s see if residuals vary among different OAs.
There are no significant patterns, so we may expect that there are no variables that are unobserved in our analysis. However, spatial analysis using Geographically Weighted Regression (GWR) may bring more information.
We start with calculating kernel bandwith for GWR computation.
## Call:
## gwr(formula = OA.Census$employed ~ . - 1, data = OA.Census[,
## sig_cols_2], adapt = GWRbandwidth, hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss
## Adaptive quantile: 0.1901549 (about 119 of 631 data points)
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu. Max. Global
## white 0.270580 0.288871 0.299836 0.313213 0.335517 0.3193
## black_african 0.023445 0.195534 0.242522 0.281999 0.351382 0.2832
## single 0.027817 0.052734 0.085185 0.213740 0.313398 0.1260
## lowest_quali 0.207876 0.276701 0.792704 0.904568 1.053125 0.5878
## highest_quali 0.462734 0.496910 0.596113 0.614051 0.651189 0.5490
## Number of data points: 631
## Effective number of parameters (residual: 2traceS - traceS'S): 22.44898
## Effective degrees of freedom (residual: 2traceS - traceS'S): 608.551
## Sigma (residual: 2traceS - traceS'S): 5.564811
## Effective number of parameters (model: traceS): 16.77853
## Effective degrees of freedom (model: traceS): 614.2215
## Sigma (model: traceS): 5.539064
## Sigma (ML): 5.464925
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 3970.666
## AIC (GWR p. 96, eq. 4.22): 3950.797
## Residual sum of squares: 18845.07
## Quasi-global R2: 0.6938792
R-squared seems to be much lower, let’s see how it looks in division to the areas.
It seems that the model is best fitted to the areas in the north, where the local R-squared is the highest.
Same as in the linear model, all coefficients are positive. We can further explore them divided among OAs. Coefficients based on race turned out to impact north-center areas the most. Similarly with qualifications - south part, mainly Chelsea is impacted in almost the same way with lowest and highest qualifications.
Let’s see how the models will predict the mean house prices aggregated by the OA.
We will work through the same methodology as with the employment rate.
Using backward variables selection we eliminated most of the insignificant variables and came up with following model:
##
## Call:
## lm(formula = OA.Census.mp$mean_price ~ ., data = OA.Census.mp[sig_cols])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2221130 -437500 -119641 227375 6426924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6183339 735000 8.413 3.64e-16 ***
## single -12374 4913 -2.519 0.01207 *
## muslim -17097 7559 -2.262 0.02411 *
## highest_quali 12094 5790 2.089 0.03721 *
## jewish 70316 23129 3.040 0.00248 **
## asian -7480 9048 -0.827 0.40879
## one_car -38962 9201 -4.234 2.70e-05 ***
## no_cars -44399 7206 -6.161 1.42e-09 ***
## Age_30_44 -12502 9360 -1.336 0.18220
## employed -15271 7348 -2.078 0.03817 *
## private_rent 5909 3734 1.582 0.11415
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 860500 on 537 degrees of freedom
## Multiple R-squared: 0.3998, Adjusted R-squared: 0.3886
## F-statistic: 35.76 on 10 and 537 DF, p-value: < 2.2e-16
On this target variable, the model is not that well-fitted, its R-squared equals to 0.399. It is interesting that the percentage of Jews impacts the average price the most postively, while having one car or no car the most negatively. The intercept estimates to over 6 million. Let’s check the residuals on each of the OAs.
There are no patterns to the residuals, so we may conclude that we did not omit any significant variable in our modelling process.
Maybe GWR model will bring more value.
## Call:
## gwr(formula = OA.Census.mp$mean_price ~ ., data = OA.Census.mp[,
## sig_cols], adapt = GWRbandwidth, hatmatrix = TRUE, se.fit = TRUE)
## Kernel function: gwr.Gauss
## Adaptive quantile: 0.9999285 (about 547 of 548 data points)
## Summary of GWR coefficient estimates at data points:
## Min. 1st Qu. Median 3rd Qu. Max. Global
## X.Intercept. 6088655.0 6120270.2 6166817.8 6319731.2 6344257.2 6183339.3
## single -12875.8 -12827.6 -12476.1 -12242.1 -12106.5 -12373.7
## muslim -18901.7 -18208.7 -17912.6 -17555.4 -17225.3 -17096.8
## highest_quali 11117.2 11480.6 11588.8 11737.1 12170.0 12094.0
## jewish 65184.9 66420.8 67963.5 70085.6 71072.1 70316.2
## asian -7583.5 -7500.8 -7284.7 -7204.3 -7159.6 -7479.5
## one_car -40864.5 -40397.0 -38314.1 -37765.0 -37536.6 -38961.7
## no_cars -45261.9 -44957.5 -44167.7 -43936.7 -43719.8 -44399.1
## Age_30_44 -14652.5 -13714.0 -13312.8 -12977.9 -12410.5 -12501.7
## employed -15235.4 -14847.0 -14650.6 -14381.1 -13797.6 -15271.2
## private_rent 5505.2 5679.7 6169.0 6293.5 6343.6 5908.7
## Number of data points: 548
## Effective number of parameters (residual: 2traceS - traceS'S): 13.47969
## Effective degrees of freedom (residual: 2traceS - traceS'S): 534.5203
## Sigma (residual: 2traceS - traceS'S): 860638.4
## Effective number of parameters (model: traceS): 12.30621
## Effective degrees of freedom (model: traceS): 535.6938
## Sigma (model: traceS): 859695.2
## Sigma (ML): 849987.5
## AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 16546.15
## AIC (GWR p. 96, eq. 4.22): 16531.13
## Residual sum of squares: 3.959183e+14
## Quasi-global R2: 0.402407
This time, R-squared is higher than in the linear model - local R-squared can be observed on the plot below:
GWR clearly fits north-west part better and gradually lowers the quality of fit going further east.
Let’s further explore the results plotting each variable and its coefficients.
- Having one car or not having a car at all impacts north areas in almost the same way. - Density of Jews in the south clearly reproduced in its coefficients values that emphasize prices on the south area. - On the other hand, OAs in the north with higher density of Asians will be mostly negatively impacted, similar to single people density. - The more Muslims in the center of Chelsea the lower the house prices, similar to density of people in the age between 30 and 44. - Density of private rent in the south positively impacts the price. - It is interesting that in the north, highest qualifications impact the price more than in the center and south.
We are now going to try different technique, called interpolation. Because we lack data in some of the OAs, we will use this method to “fill in the gaps”. For the first method we will need to create Thiessen polygons, that actually creates areas similar to our OAs, by assigning boundaries to the closest housing point. We then clip those boundaries to the area of Kensington and Chelsea and finally plot them on the map, filling the areas with price levels.
##
## PLEASE NOTE: The components "delsgs" and "summary" of the
## object returned by deldir() are now DATA FRAMES rather than
## matrices (as they were prior to release 0.0-18).
## See help("deldir").
##
## PLEASE NOTE: The process that deldir() uses for determining
## duplicated points has changed from that used in version
## 0.0-9 of this package (and previously). See help("deldir").
Looking at the arrangement of polygons we can determine that there are indeed groups of neighbours on south and center-west while the polygons in the north are much more apart. Of course considering the data we used for 2011.
Next, we will use IDW. We will convert point data of house prices to numerical values spread over continous surface - mostly for easier and more approachable visualisations how the data is distributed across space. This is also a method of interpolation - just different to usage of Thiessen polygons.
## [inverse distance weighted interpolation]
3D visualisation (especially the interactive version - which cannot be seen in Markdown, but code for it is provided) further reassure our previous remarks.
There were some inisights that we did not expect to come upon. For example, we concluded that there might be more of a wealth division rather than racial - although there are areas in which black/african people constitute for larger percentage of citizens, these are the same areas which have majority of native British citizens. However - there are very distinct areas which are dominated by people from EU with the highest qualifications therefore effecting in lowest unemployment rates. Moreover, immigrants, people of other race than white and people of other religion than christian tend to cluster into their own OAs, completly separate from regions that are populated mostly by white, rather rich and full of highly-educated citizens (which are less concentrated and there is no marked boundary)
Considering our hypotheses though, we confirmed that demographic factors do indeed affect employment rate in given OAs - however during the analysis the addition of spatially lagged variables did not improve the predicitive ability of the models. However, mean prices of houses in the OAs are dependant on spatial factors as well as variables from the same OA - GWR model introduced improved results comparing to simple linear regression.
We acknowledge shortcomings of our work - for example we only analyzed prices of actually SOLD houses. That is however the only data accessible in a well structured format - we would analyze prices of listed houses but that would require a lot of scrapping from sources we are not familiar with. Of course, we would like to also include other districts in our research - maybe compare less diverse than K&D and see whether there are major differences in e.g. house price patterns.